This vignette walks you through the unification of all the disparate parts in the Masmr pipeline: spot calling, cell segmentation and stitching,
library(masmr)
library(imager)
library(abind)
library(ggplot2)
library(patchwork)
library(bioimagetools)
library(RBioFormats)
library(Seurat)
library(dplyr)
library(gridExtra)
library(png)
library(grid)
data_dir <- "~/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/"
out_dir <- "~/OUT"
establishParams(
imageDirs = paste0(data_dir, 'S10/'),
imageFormat = ".ome.tif",
codebookFileName = paste0(data_dir, "codebook/Jin_codebook_B.tsv"),
codebookColumnNames = paste0(rep(c(0:3), 4), '_', rep(c('cy3', 'alexa594', 'cy5', 'cy7'), each=4) ),
outDir = out_dir,
dapiDir = paste0(data_dir, "DAPI_1/"),
fpkmFileName = paste0(data_dir, "FPKM_data/Jin_FPKMData_B.tsv"),
nProbesFileName = paste0(data_dir, "ProbesPerGene.csv")
)
establishCleanSlate()
## [1] ".Random.seed" "bit_number" "cleanSlate" "clusterInfo" "cxWindow" "data_dir" "distanceMetric"
## [8] "file" "filtout" "forExclusion" "forReg" "genePalette" "hnnDistBins" "i"
## [15] "im" "imList" "imMetricName" "imMetrics" "imsub" "maxClusterSize" "metric"
## [22] "n" "out_dir" "p1" "p2" "p3" "p4" "params"
## [29] "percBlank" "probabilityBLANK" "series" "spotcalldf" "thresh" "troubleshootPlots"
params$current_fov <- 'MMStack_1-Pos_002_003'
As a consequence of running scripts from preceding chapters, we expect three subdirectories in the user-specified output directory: one for spotcalling outputs (IM), one for cell segmentation (DAPI), and one for stitching (STITCH). The outputs from these three subdirectories will be combined by function synthesiseData(), and output into a final OUT subdirectory.
synthesiseData()
spotcalldf <- data.table::fread(paste0(params$parent_out_dir, '/OUT/OUT_SPOTCALL_PIXELS.csv.gz'), data.table = F)
cowplot::plot_grid( plotlist=list(
ggplot() +
geom_hex(data=spotcalldf, aes(x=Xm, y=Ym), bins=100) +
scale_fill_viridis_c(option='turbo') +
scale_y_reverse() +
theme_minimal(base_size=14) +
xlab('X (microns)') + ylab('Y (microns)') +
theme(legend.position = 'none') +
coord_fixed() +
ggtitle('Spot density'),
ggplot() +
geom_point(data=spotcalldf, aes(x=Xm, y=Ym, colour=fov), shape='.', alpha=0.5) +
scale_colour_viridis_d(name = '', option='turbo') +
scale_y_reverse() +
theme_minimal(base_size=14) +
xlab('X (microns)') + ylab('Y (microns)') +
guides(colour=guide_legend(override.aes = list(size=5, shape=16), ncol=1)) +
coord_fixed() +
theme(legend.position = 'none') +
ggtitle('Spots by FOV')
), nrow=1)
Compared to other spatial transcriptomic custom image processing pipelines like MERLIN and STARFISH, MASMR provides the users with a plethora of useful quality control plots that can be used to evaluate the performance of the pipeline.
params$verbose = F
plotQC(
includeNewBlanks=T,
synthesisDir = paste0(params$parent_out_dir, "/OUT/")
)
## Warning: Removed 184 rows containing missing values or values outside the scale range (`geom_hex()`).
## Removed 184 rows containing missing values or values outside the scale range (`geom_hex()`).
FPKMCorrelation This plot returns the correlation between expected expression and the number of spots actually observed (bot log-transformed); strong correlations are ideal. In the title of this plot, we also report the percentage of spots that are blanks (specifically those with blank in their names), and the percentage of spots that overlap a cell mask.
knitr::include_graphics(paste0(params$parent_out_dir, '/OUT/QC/FPKMCorrelation.png'))
CosineDistanceDistribution This shows a distribution of cosine distances (COS) per gene. Lower cosine distances implies higher confidence in the decoding of that spot.
knitr::include_graphics(paste0(params$parent_out_dir, '/OUT/QC/CosineDistanceDistribution.png'))
SpotFOVDistribution This shows the distribution of spots, averaged across FOVs. This is used to check if spot coordinate in a given FOV influences its detection rate – ideally, this should not occur. This plot is less useful if only a handful of FOVs were processed.
knitr::include_graphics(paste0(params$parent_out_dir, '/OUT/QC/SpotFOVDistribution.png'))
CosineSpatialDistribution This plot shows the average cosine distances per 2D bin, across the entire tissue. Ideally, spatial location should not influence confidence in spot calls.
knitr::include_graphics(paste0(params$parent_out_dir, '/OUT/QC/CosineSpatialDistribution.png'))
BitDetectionRate This shows how the ON / OFF status for each bit influences the number of spots called. We tabulate the number of spots per gene, and then split the genes into two groups for each bit: if they have a ON (1), or an OFF (0). This provides some indication as to whether separation of 0 vs 1 is clearer / more ambiguous for certain bits. Ideally, there should be no obvious bias in detection rate for any bits.
knitr::include_graphics(paste0(params$parent_out_dir, '/OUT/QC/BitDetectionRateAll.png'))
FOVMetrics This shows some statistics averaged per FOV. Useful for determining if there are systematic failures in spot calling for certain FOVs.
knitr::include_graphics(paste0(params$parent_out_dir, '/OUT/QC/FOVMetrics.png'))
CellMetrics This shows cell-wise metrics. Below, we show average cell size (in pixels) per 2D bin.
img1 <- readPNG(paste0(params$parent_out_dir, '/OUT/QC/CellMetrics_nCounts.png'))
img2 <- readPNG(paste0(params$parent_out_dir, '/OUT/QC/CellMetrics_CellSize.png'))
p1 <- rasterGrob(img1, interpolate=TRUE)
p2 <- rasterGrob(img2, interpolate=TRUE)
grid.arrange(p1, p2, ncol = 2)
SpatialPatterns This comprises two related plots. The first comprises the SpatialPatterns themselves. As default, we typically return four or fewer spatial patterns. These plots provide a quick way for users to determine if their genes are adopting spatial patterns of expression that matches their expectations.
img1 <- readPNG(paste0(params$parent_out_dir, '/OUT/QC/SpatialPatterns_SpatialPatterns.png'))
img2 <- readPNG(paste0(params$parent_out_dir, '/OUT/QC/SpatialPatterns_GeneMembership.png'))
p1 <- rasterGrob(img1, interpolate=TRUE)
p2 <- rasterGrob(img2, interpolate=TRUE)
grid.arrange(p1, p2, nrow = 2)
Check the number of genes per cell as well:
cellExp <- data.table::fread(file = paste0(params$parent_out_dir, '/OUT/OUT_CELLEXPRESSION.csv.gz') , data.table = F)
cellExp <- cellExp[!duplicated(cellExp$V1),] # why are there duplicated entries?
rownames(cellExp) <- cellExp[,1]
cellExp[,1] <- NULL
hist(log10(1+colSums(cellExp)),
main = paste0(
'Cells x genes: ', paste( dim(cellExp), collapse=' x '),
', Median nCounts: ', median(colSums(cellExp))),
breaks=25, xlab = 'Log10 nCounts per cell');
abline(v=log10(1+median(colSums(cellExp))), col='red', lwd = 3)
Now, just as a preliminary analysis, we can read in the OUT_CELLEXPRESSION.csv.gz and process it using Seurat package to get an idea of the cells in our dataset.
# filter out top 2.5 and bottom 2.5 nuclei by area
nuclei_area <- read.csv(
paste0(params$parent_out_dir, "/OUT/OUT_CELLS.csv.gz"),
check.names = FALSE
)
q_a1 <- quantile(nuclei_area$nPixels, c(.025))
q_a2 <- quantile(nuclei_area$nPixels, c(.975))
cellIDs_a <- nuclei_area$CELLNAME[nuclei_area$nPixels <= q_a1[1]]
cellIDs_a <- nuclei_area$CELLNAME[nuclei_area$nPixels <= q_a2[1]]
gcm <- read.csv(
paste0(params$parent_out_dir, "/OUT/OUT_CELLEXPRESSION.csv.gz"),
row.names = 1,
check.names = FALSE)
gcm <- gcm[rownames(gcm) %in% cellIDs_a, ]
gcm_mat <- as.matrix(gcm)
gcm_t <- t(gcm_mat)
seur_obj <- CreateSeuratObject(
counts = gcm_t,
assay = "RNA",
min.cells = 0,
min.features = 0
)
## Warning: Data is of class matrix. Coercing to dgCMatrix.
seur_obj$orig.ident <- "OUT"
seur_obj <- seur_obj[grep("BLANK", rownames(seur_obj), ignore.case = TRUE, invert = TRUE),]
seur_obj <- subset(seur_obj, nFeature_RNA > 10)
seur_obj <- SCTransform(object = seur_obj, return.only.var.genes = FALSE, vst.flavor = "v2",
variable.features.n = 3000, seed.use = 1448145, verbose = FALSE)
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `vst.flavor` is set to 'v2' but could not find glmGamPoi installed.
## Please install the glmGamPoi package for much faster estimation.
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('glmGamPoi')
## --------------------------------------------
## Falling back to native (slower) implementation.
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
## Warning: The `slot` argument of `SetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
seur_obj <- RunPCA(seur_obj,features = rownames(seur_obj), verbose = FALSE, assay = "SCT")
## Warning in svd.function(A = t(x = object), nv = npcs, ...): You're computing too large a percentage of total singular values, use a standard svd instead.
print(ElbowPlot(seur_obj))
PC_dim <- 4 # based on elbow plot
DefaultAssay(seur_obj) <- "SCT"
seur_obj <- RunUMAP(seur_obj, dims = 1:PC_dim, verbose = FALSE,
min.dist = 0.1, repulsion.strength = 3, negative.sample.rate = 10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
seur_obj <- FindNeighbors(seur_obj, dims = 1:PC_dim, verbose = FALSE)
seur_obj <- FindClusters(seur_obj, verbose = FALSE, resolution = 0.8)
p0 <- DimPlot(object = seur_obj, group.by = "seurat_clusters", label = T)
# DE analysis + Heatmap
seur_obj <- PrepSCTFindMarkers(seur_obj, assay = "SCT", verbose = TRUE)
## Only one SCT model is stored - skipping recalculating corrected counts
saveRDS(seur_obj, file = paste0(params$parent_out_dir, "/OUT/masmr_seurat.rds"))
seur_obj.markers <- FindAllMarkers(seur_obj, only.pos = FALSE,
min.pct = 0.25, logfc.threshold = 0.25,
assay = "SCT", slot = "data")
## Calculating cluster 0
## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
# save all markers
all_markers <- seur_obj.markers %>% group_by(cluster)
DefaultAssay(seur_obj) <- "RNA"
seur_obj <- NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
## Normalizing layer: counts
seur_obj <- ScaleData(seur_obj, features = rownames(seur_obj))
## Centering and scaling data matrix
top5_markers <- all_markers %>% group_by(cluster) %>% top_n(n = 5, wt = avg_log2FC)
p1 <- DoHeatmap(subset(seur_obj, downsample = 1000), features = top5_markers$gene,
assay = "RNA", slot = "scale.data",
disp.min = -2.5, disp.max = 2.5)
print(p0 + p1)
🎉 Congratulations! You have just finished the MASMR pipeline!
Advanced usage examples covered in the advanced usage handbook: - Accounting for intentional overlap between FOVs. - Subsetting to a single sample if multiple samples are analyzed in a single MERFISH run.